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Abstract 

H- 8 

I Kolmogorov argued that the concept of information exists also in problems with no 

underlying stochastic model (as Shannon's information representation) for instance, the 
information contained in an algorithm or in the genome. He introduced a combinatorial 
notion of entropy and information I(x : y) conveyed by a binary string x about the 

£^ unknown value of a variable y. The current paper poses the following questions: what 

is the relationship between the information conveyed by x about y to the description 
complexity of x ? is there a notion of cost of information ? are there limits on how 

'— 1 efficient x conveys information ? To answer these questions Kolmogorov's definition 

is extended and a new concept termed information width which is similar to n-widths 
in approximation theory is introduced. Information of any input source, e.g., sample- 
based, general side- information or a hybrid of both can be evaluated by a single common 

ON formula. An application to the space of binary functions is considered. 

t> 
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O 1 Introduction 

OO 

Kolmogorov [T3] sought for a measure of information of 'finite objects'. He considered three 
approaches, the so-called combinatorial, probabilistic and algorithmic. The probabilistic 
^ approach corresponds to the well-established definition of the Shannon entropy which applies 

to stochastic settings where an 'object' is represented by a random variable. In this setting, 
the entropy of an object and the information conveyed by one object about another are well 
defined. Here it is necessary to view an object (or a finite binary string) as a realization of a 
stochastic process. While this has often been used, for instance, to measure the information 
of English texts E] by assuming some finite-order Markov process, it is not obvious 
that such modeling of finite objects provides a natural and a universal representation of 
information as Kolmogorov states in [T3]: What real meaning is there, for example, in 
asking how much information is contained in (the book) "War and Peace" ? Is it reasonable 
to ... postulate some probability distribution for this set ? Or, on the other hand, must 
we assume that the individual scenes in this book form a random sequence with stochastic 
relations that damp out quite rapidly over a distance of several pages ? These questions 
led Kolmogorov to introduce an alternate non-probabilistic and algorithmic notion of the 
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information contained in a finite binary string. He denned it as the length of the minimal- 
size program that can compute the string. This has been later developed into the so-called 
Kolmogorov Complexity field [15 . 

In the combinatorial approach, Kolmogorov investigated another non stochastic measure 
of information for an object y. Here y is taken to be any element in a finite space Y of 
objects. In [13] he defines the 'entropy' of Y as H(Y) = log |Y| where |Y| denotes the 
cardinality of Y and all logarithms henceforth are taken with respect to 2. As he writes, if 
the value of Y is known to be Y = {y} then this much entropy is 'eliminated' by providing 
log |Y| bits of 'information'. 

Let R = X x Y be a general finite domain and consider a set 

ACR (1) 
that consists of all 'allowed' values of pairs (x, y) G R. The entropy of Y is defined as 

H(Y) = log |77 Y (A)| 

where IIy(A) = {y G Y : (x, y) G A for some x G X} denotes the projection of A on Y. 
Consider the restriction of A on Y based on x which is defined as 

Y x = {y G Y : (x, y) G A}, x G I7 X (A) (2) 

then the conditional combinatorial entropy of Y given x is defined as 

H(Y\x)=log\Y x \. (3) 

Kolmogorov defines the information conveyed by x about Y by the quantity 

I(x : Y) = H(Y) — H(Y\x). (4) 

Alternatively, we may view I{x : Y) as the information that a set Y x conveys about another 
set Y satisfying Y x OY. In this case we let the domain be R = IIy(A) x IIy(A), A x C R 
is the set of permissible pairs A x = {(y,y') : y G IlY(A),y' G Y x } and the information is 
defined as 

I(Y X : Y) = log |77 ¥ (^)| 2 - log(\Y x \\I7 Y (A)\). (5) 

We will refer to this representation as Kolmogorov's information between sets. Clearly, 
I(Y X : Y) = I(x : Y). 

In many applications, knowing an input x only conveys partial information about an 
unknown value y G Y. For instance, in problems which involve the analysis of algorithms on 
discrete classes of structures, such as sets of binary vectors or functions on a finite domain, 
an algorithmic search is made for some optimal element in this set based only on partial 
information. One such paradigm is the area of statistical pattern recognition [2 where 
an unknown target, i.e., a pattern classifier, is seeked based on the information contained 
in a finite sample and some side-information. This information is implicit in the particular 
set of classifiers that form the possible hypotheses. 

For example, let n be a positive integer and consider the domain [n] = {1, . . . ,n}. Let 
F = {0, ljf" - ' be the set of all binary functions / : [n] — * {0,1}. The power set V(F) 
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represents the family of all sets G C F. Repeating this, we have V{T'{F)) as the collection 
of all properties of sets G, i.e., a property is a set whose elements are subsets G of F. We 
denote by A4 a property of a set G and write G \= M. Suppose that we seek to know an 
unknown target function t £ F. Any partial information about t which may be expressed 
by t E G \= M can effectively reduce the search space. It has been a long-standing problem 
to try to quantify the value of general side-information for learning (see [M] and references 
therein). 

We assert that Kolmogorov's combinatorial framework may serve as a basis. We let x 
index possible properties M of subsets G C F and the object y represent the unknown 
target t which may be any element of F. Side information is then represented by knowing 
certain properties of sets that contain the target. The input x conveys that t is in some 
subset G that has a certain property A4 X . 

In principle, Kolmogorov's quantity I(x : Y) should serve as the value of information 
in x about the unknown value of y. However, its current form Q is not general enough 
since it requires that the target y be restricted to a fixed set Y x on knowledge of x. To 
see this, suppose t is in a set that satisfies property M. x . Consider the collection {G z } z& z x 
of all subsets G z C F that have this property. Clearly, t E Uzez ^ z hence we may first 
consider Y x = U z g2 ^ z but some useful information implicit in this collection is ignored 
as we now show: consider two properties A4q and A4i with corresponding index sets Z Xo 
and Z X1 such that \J z€ z x G z = \J z <=z x G z = F' C F. Suppose that most of the sets G z , 
z E Z XQ are small while the sets G z , z E Z X1 are large. Clearly, property M.q is more 
informative than M\ since starting with knowledge that t is in a set that satisfies it should 
take (on average) less additional information (once the particular set G becomes known) in 
order to completely specify t. If, as above, we let A XQ = \J zGZ G z and A Xl = \J GZ G z 
then we have I(xq : Y) = I(x\ : Y) which wrongly implies that both properties are equally 
informative. Knowing A4q provides implicit information associated with the collection of 
possible sets G z , z E Z Xo . This implicit structural information cannot be represented in Q . 

2 Overview 

In |23] we began to consider an extension of Kolmogorov's combinatorial information that 
can be to applied for more general settings. The current paper further builds upon this and 
continues to explore the 'objectification' of information, viewing it as a 'static' relationship 
between sets of objects in contrast to the standard Shannon representation. As it is based on 
basic set theoretic principles no assumption is necessary concerning the underlying space of 
objects other than its finiteness. It is thus more fundamental than the standard probability- 
based representation used in information theory. It is also more general than Bayesian 
approaches, for instance in statistical pattern recognition, which assume that a target y is 
randomly drawn from Y according to a prior probability distribution. 

The main two contributions of the paper are, first, the introduction of a set-theoretic 
framework of information and its efficiency, and secondly the application of this framework 
to classes of binary functions. Specifically, in Section [6] we define a quantity called the 
information width (Definition [6| which measures the information conveyed about an un- 
known target y by a maximally-informative input of a fixed description complexity I (this 
notion is defined in Definition [9]) . The first result, Theorem [Tl computes this width and 
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it is consequently used as a reference point for a comparison of the information value of 
different inputs. This is done via the measures of cost and efficiency of information de- 
fined in Definitions [5] and [7} The width serves as a universal reference against which any 
type of information input may be computed and compared, for instance, the information 
of a finite data sample used in a problem of learning can be compared to other kinds of 
side-information. 

In Section [7] we apply the framework to the space of binary functions on a finite domain. 
We consider information which is conveyed via properties of classes of binary functions, 
specifically, those that relate to the complexity of learning such functions. The properties 
are stated in terms of combinatorial quantities such as the Vapnik-Chervonenkis (VC) di- 
mension. Our interest in investigating the information conveyed by such properties stems 
from the large body of work on learning binary function classes (see for instance [J| [T7\ I19|). 
This is part of the area called statistical learning theory that deals with computing the com- 
plexity of learning over hypotheses classes, e.g., neural networks, using various algorithms 
each with its particular type of side information (which is sometimes referred to as the 
'inductive bias', see [18]). 

For instance it is known [2] that learning an unknown target function in a class of VC- 
dimension (defined in Definition [8]) which is no greater than d requires a training sample of 
size linear in d. The knowledge that the target is in a class which has this property conveys 
the useful information that there exist an algorithm which learns any target in this class 
to within arbitrarily low error based on a finite samples (in general this is impossible if the 
VC-dimension is infinite). But how valuable is this information, can it be quantified and 
compared to other types of side information ? 

The theory developed here answers this and treats the problem of computing the value 
of information in a uniform manner for any source. The generality of this approach stands 
on its set-theoretic basis. Here information, or entropy, is defined in terms of the number of 
bits that it takes to index objects in general sets. This is conveniently applicable to settings 
such as those in learning theory where the underlying structures consist of classes of ob- 
jects or inference models (hypotheses) such as binary functions (classifiers), decision trees, 
Boolean formulae, neural networks. Another area (unrelated to learning theory) which can 
be applicable is computational biology. Here a sequence of the nucleotides or amino acids 
make up DNA, RNA or protein molecules and one is interested in the amount of functional 
information needed to specify sequences with internal order or structure. Functional infor- 
mation is not a property of any one molecule, but of the ensemble of all possible sequences, 
ranked by activity. As an example, [25] considers a pile of DNA, RNA or protein molecules 
of all possible sequences, sorted by activity with the most active at the top. More infor- 
mation is required to specify molecules that carry out difficult tasks, such as high-affinity 
binding or the rapid catalysis of chemical reactions with high energy barriers, than is needed 
to specify weak binders or slow catalysts. But, as stated in [25], precisely how much more 
functional information is required to specify a given increase in activity is unknown. Our 
model may be applicable here if we let all sequences of a fixed level a of activity be in the 
same class that satisfies a property M a . Its description complexity (introduced later) may 
represent the amount of functional information. 

In Section [7] we consider an application of this model and state Theorems [2] - [5] which 
estimate the information value, the cost and the description complexity associated with 
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different properties. This allows to compute their information efficiency and compare their 
relative values. For instance, Theorem [5] considers a hybrid property which consists of two 
sources of information, a sample of size m and side-information about the VC-dimension d 
of the class that contains the target. By computing the efficiency with respect to m and d 
we determine how effective is each of these sources. In Section [8] we compare the information 
efficiency of several additional properties of this kind. 

3 Combinatorial formulation of information 

In this section we extend the information measure Q to one that applies to a more general 
setting (as discussed in Section [T]) where knowledge of x may still leave some vagueness 
about the possible value of y. As in |13j we seek a non-stochastic representation of the 
information conveyed by x about y. 

Henceforth let Y be a general finite domain, let Z = V(Y) and X = "P(Z) where as 
before for a set E we denote by V(E) its power set. Here Z represents the set of indices z of 
all possible sets Y z C Y and X is the set of indices x of all possible collections, i.e., subsets 
Z x C Z of indices of sets Y z , z £ Z. We say that a set Y z C Y has a property Ai x if z £ Z x 
where Z x is the uniquely corresponding collection of A4 X . 

The previous representation based on Q is subsumed by this representation since in- 
stead of X we have Z and the sets Y x , x £ IIx(A) defined in ^ are now represented by 
the sets Y z , z £ Z x . Since X indexes all possible properties of sets in Y then for any A as 
in ([I]) there exists an x G X in the new representation such that LJx(A) in Kolmogorov's 
representation is equivalent to Z x in the new representation. Therefore what was previously 
represented by the sets {Y x : x G IJx(A)} is now the collection of sets {Y z : z £ Z x }. In 
this new representation a given input x can point to multiple subsets Y z of Y, z £ Z x , and 
hence apply for the more general settings discussed in the previous section. 

We will view the information conveyed by x about an unknown object y through two 
perspectives. The first is held by the side that provides the information and the second by 
the side which acquires it. From the side of the provider, we denote by the subset y C Y 
a set of target values y, for instance, solutions to a problem any one of which the provider 
may wish to inform the acquirer. In general the provider provides partial information about 
y via an object x £ X which is used as a means of representing this information and as 
input to the acquirer. 

From the acquirer's side, initially (before seeing x) the set of possible targets is the whole 
target-domain Y since he does not 'know' the subset y. After seeing the input x he then 
has a collection of sets Y z , z £ Z x , one of which is ensured (by the provider) to intersect 
the subset y. In this case we say that x is informative about y (Definition [TJ . Kolmogorov's 
representation fits the acquirer's perspective where the unknown subset y is just the whole 
target domain Y (known by default) and therefore x is the only variable in the information 
formula of Q. In all subsequent definitions that involve y we may switch between the two 
perspectives simply by replacing y with Y. 

Definition 1 Let y C Y be fixed. An input object x G X is called informative for y, denoted 
x h y, if there exists a z £ Z x with a corresponding set Y z such that y f] Y z ^ 0. 

The following is our definition of the combinatorial value of information. 
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Definition 2 Let yCY and consider any x£X such that x h y. Define by 

I(z :y) =log(\y(jY z \ 2 ) -log(\y\\Y z \). 
Then the information conveyed by x about the unknown value y £ y is defined as 

Remark 1 For a fixed x, the information value I(x : y) is in general dependent on y since 
different y for which x is informative (x h y) may have different values of I(z : y), z e Z x . 
The information value is non-negative real measured in bits. 

Henceforth, it will be convenient to assume that I(x : y) = whenever x is not informative 
for y. 

Remark 2 We will refer to I{x : y) as the provider's (or provided) information about the 
unknown target y given x. The acquirer's (or acquired) information is defined based on the 
special case where y = Y. Here we have (ylj^zl = I^U^zl = \^\ an d information value 
becomes 



I(x:Y) = j±- £ [21og|Y|-log|Y|-log|Yi|] 

= log|Y|--^ £> g |y*|. (6) 

1 x '' z&Z x 

Definition [2] is consistent with Q in that the representation of uncertainty is done as in 
|13| in a set-theoretic approach since all expressions in ([6]) involve set-quantities such as 
cardinalities and restrictions of sets. The expression of (HI) is a special case of @ with 
Z x being a singleton set and y = Y. In defining I{z : y) we have implicitly extended 
Kolmogorov's information I(Y X : Y) between sets Y x and Y that satisfy Y x C Y (see ^) 
into the more general definition where one of the two sets is not necessarily contained in the 
other and neither one equals the whole space Y, i.e., I{z : y) = I(Y Z : y) is the information 
between the sets Y z and y where Y z is not necessarily contained in y. Here we take the 
underlying two-dimensional domain as R = (y\}Y z ) x (yU^ z ) an d the set of permissible 
pairs as 

A z , y = {(y,y') :yey,y'eY z }c R. 

We may view the relationship between the provider and acquirer as a transformation be- 
tween sets 

y _> { Yz : z e Z X }^Y 

where the provider, knowing the set y, chooses some x with which he represents the unknown 
value of y and for him, the amount of information remaining about y as conveyed by x is 
I{x : y) bits. The acquirer, starting from knowing only Y, uses x, or equivalently the 
corresponding collection of sets {Y z : z G Z x }, as an intermediate 'medium' to acquire 
I{x : Y) bits of information about the unknown value of y. Note that, in general, the 
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provider's information may be smaller, equal or larger than the acquired information. For 
instance, fix an x, then directly from Definition [2] for any z 6 Z x we can compare I(z : y) 
versus I{z : Y) and see that if |ylj^l is closer to |Y| (or |y|) than to |y| (or |Y|) then 
I{z : y) > I{z : Y) (or I{z : Y) > I(z : y)) respectively. Thus taking the average over all 
z G Z x it is possible in general to have I{x : y) smaller, larger or equal to I(x : Y). 

In this paper we will primarily use the acquirer's perspective and will thus refer to the 
sum in ^ as the conditional combinatorial entropy which is defined next. 

Definition 3 Let 

H(Y\x) = j±-J2 l °S\ Y *\ ( 7 ) 

1 x ' zez x 

be the conditional entropy ofY given x. 
It will be convenient to express the conditional entropy as 

H(Y\x) = Y,u x {k)\o%k 

k>2 

with 

\{zGZ x :\Y z \=k}\ 
Ux(k) = — . (8) 

\ Zj x\ 

We will refer to this quantity u) x (k) as the conditional density function of k. The factor of 
log A: comes from log \Y Z \ which from Q is the combinatorial conditional-entropy H(Y\z). 

4 Description complexity 

We have so far defined the notion of information I{x : y) about the unknown value y 
conveyed by x. Let us now define the description complexity of x. 

Definition 4 The description complexity of x, denoted £(x), is defined as 

«(x) = logJpL. (9) 

Remark 3 The description complexity t(x) is a positive real number measured in bits. It 
takes a fractional value if the cardinality of Z x is greater than half that of Z. 

Definition [4] is motivated from the following: from Section [3j the input x conveys a 
certain property common to every set Y z , z £ Z x C Z, such that the unknown value y is an 
element of at least one such set Y z . Without the knowledge of x these indices z are only 
known to be elements of Z in which case it takes log |Z| bits to describe any z, or equivalently, 
any Y z . If x is given then the length of the binary string that describes a z in Z x is only 
log \ Z X \. The set Z x can therefore be described by a string of length log |Z| — log \ Z X \ which 
is precisely the right side of ([9]). Alternatively, l(x) is the information I(x : Z) gained about 
the unknown value z given x (since x points to a single set Z x then this information follows 
directly from Kolmogorov's formula Q). 

As \Z X \ decreases there are fewer possible sets Y z that satisfy the property described by 
x and the description complexity £(x) increases. In this case, x conveys a more 'special' 
property of the possible sets Y z and the 'price' of describing such a property increases. The 
following is a useful result. 
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Lemma 1 Denote by Z^ = 7L\Z X the complement of the set Z x and let x c denote the input 
corresponding to Z£. Then 

£{x c ) = -log(l -2~ e ^). 

Proof: Denote by p = \Z x \j\L\. Then by definition of the description complexity we have 
£(x c ) = — log(l —p). Clearly, 

2 -i°St^ = i_ 2 - log ^ 

from which the result follows. □ 

Remark 4 Since clearly the proportion of elements z 6 Z which are in Z x plus the pro- 
portion of those in Z x is fixed and equals 1 then 2~^ x ' + 2~^ ;rC ) = 1. If the description 
complexity £{x) and £{x c ) change (for instance with respect to an increase in \L\) then they 
change in opposite directions. However, the cardinalities of the corresponding sets Z x and 
Z x may both increase |5p for instance if |Z| grows at a rate faster than the rate of change 
of either £{x) or £{x c ). 

A question to raise at this point is whether the following trivial relationship between 
£(x) and the entropy H(Y\x) holds, 

£{x) + H(Y\x) = H(Y). (10) 

This is equivalent to asking if 

£{x) = I{x : Y) (11) 

or in words, does the price of describing an input x equals the information gained by knowing 
it ? 

As we show next, the answer depends on certain characteristics of the set Z x . When Q 
does not apply but Q does, then in general, the relation does not hold. 

5 Scenario examples 

In all the following scenarios we take the acquirer's perspective, i.e., with no input given 
the unknown y is only known to be in Y. As the first scenario, we start with the simplest 
uniform setting which is defined as follows: 

Scenario SI: As in Q-Q, an input x amounts to a single set Y x . The set Z x is a singleton 
{Y x } so \Z X \ = 1 and instead of Z we have IIx(A). We impose the following conditions: for 
all x,x' £ X, Y x f)Y x < = 0, \Y X \ = \Y X >\. With Y = UxeX^ then il follows that for any x, 
\Y X \ = ffl. From (4) it follows that the description complexity of any x is 

IXI 

£(X) = log l ~y = log |X| 

and the entropy 

ZIYI 

H{Y\x) =log|y x =log' " 
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We therefore have 



(x) + H(Y\x) = log |X| + log |Y| - log |X| = log |Y|. 



Since the right side equals H(Y) then (10) holds. Next consider another scenario: 



Scenario S2: An input x gives a single set Y x but now for any two distinct x,x', we only 
force the condition that \Y X \ = \Y X >\, i.e., the intersection Y^P|Y^/ may be non-empty. The 
description complexity £(x) is the same as in the previous scenario and for any x, x' £ X 

the entropy is the same H(Y\x) = H(Y\x') with a value of log (^jxp) , f° r some a > 1. So 
(x)+H(Y\x) = log |X| +log ( ] = log(a|Y|) > log|Y|. 



Hence the left side of (10) is greater than or equal to the right side. By (11), this means 
that the 'price', i.e., the description complexity per bit of information may be larger than 
1. 

Let us introduce at this point the following combinatorial quantity: 
Definition 5 The cost of information I{x : y), denoted Hy{x), is defined as 

£(x) 



I(x : y) 



and represents the number of description bits of x per bit of information about the unknown 
value of y as conveyed by x. 

Thus, letting y = Y and considering the two previous scenarios where Kolmogorov's defi- 
nition Q applies then the cost of information equals 1 or at least 1, respectively. As the 
next scenario, let us consider the following: 

Scenario S3: We follow the setting of Definition [2] where an input x means that the un- 
known value of y is contained in at least one set Y z , z £ Z x hence \Z X \ > 1. Suppose that 
S = a for some integer a > 1 and assume that for all i£X, H(Y\x) = log(a). (The sets 
Y z , z £ Z x may still differ in size and overlap). Thus we have 

IZI lYl 

£( x )+H(Y\x)=log^+log^. (12) 

Suppose that = b for some integer b > 1, Z x f] Z x i = and \Z X \ = \Z X >\ for any x, x' £ X. 
Since Z = \J xe x %x then \Z X \ = b for all x £ X. The right side of (12) equals log |Y| and 



(10) is satisfied. If for some x, x' we have \Z X \ < b and \Z x /\ > b (with entropies both still 



at log(a)) then the left side of (10) is greater than or less than H(Y), respectively. Hence 
it is possible in this scenario for the cost k v (x) to be greater or less than 1. To understand 
why for some inputs x the cost may be strictly smaller than 1 observe that under the 
current scenario the actual set Y z which contains the unknown y remains unknown even 



after producing the description x. Thus in this case the left side of (10) represents the 
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total description complexity of the unknown value of y (on average over all possible sets 
Y z ) given that the only fact known about Y z is that its index z is an element of Z x . In 
contrast, scenario SI has the total description complexity of the unknown y on the left side 



of (10) which also includes the description of the specific Y x that contains y (hence it may 
be longer). Scenario S3 is an example, as mentioned in Section [T] of knowing a property 
which still leaves the acquirer with several sets that contain the unknown y. 

In Section [7] we will consider several specific properties of this kind. Let us now continue 
and introduce additional concepts as part of the framework. 

6 Information width and efficiency 

With the definitions of Section [3] in place we now have a quantitative measure of the infor- 
mation (and cost) conveyed by an input x about an unknown value y. This y is contained 
in some set that satisfies a certain property and the set itself may remain unknown. In 
subsequent sections we consider several examples of inputs x for which these measures are 
computed and compared. Amongst the different ways of conveying information about an 
unknown value y it is natural to ask at this point if there exists a notion of maximal infor- 
mation. This is formalized next by the following definition which resembles n-widths used 
in functional approximation theory [21 . 

Definition 6 Let 



I* (I) = max mini"(x : y) (13) 

x-eX yCY 
£(x)=l xhy 



be the /^-information- width. 



Remark 5 The above definition is stated from the provider's point of view. He is free to 
choose a fixed 'medium', i.e., a structure Z x of sets (but limited in its description complexity 
to I) in order to provide information at some later time about any set y C Y of objects to 
the acquirer. For that he considers all possible inputs x of description complexity I and 
measures the information it will provide for the hardest target-subset y. We refer to the 
above as the provider's information width. 

If we set y = Y then we obtain the acquirer's width of information, denoted as 

no) = i* (i) 

which takes a simpler form of 

I* (I) = max I(x : Y). 

e{x)=i 

The next result computes the value of /*(/). 
Theorem 1 Denote by N the positive integers. Let 1 < I < log |Z| and define 
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Then we have 



r(o = iog|Y|-|| rt ( l I') logA;+ ( |z|2 ~'" £ ('?) I iosKO I - (i4) 

Proof: Consider a particular input x* € X with a description complexity t(x*) = I and 
with a corresponding Z x * that contains the indices z of as many distinct non-empty sets 
Y z of the lowest possible cardinality. By ^ it follows that Z x * satisfies \Z X * \ = |Z|2~' and 
contains all z such that 1 < \Y Z \ < r(l) — 1 in addition to |Z|2~' — Ya=\ 1 ('I') elements z 
for which \Y Z \ = r(l). We therefore have I(x* : Y) as equal to the right side of ( |14[ ). Any 
other x with £{x) = I must have H(Y\x) > i7(Y|x*) since it is formed by replacing one of 
the sets Y z above with a larger set Y z >. Hence for such x, I{x : Y) < I{x* : Y) and therefore 
I*(Z) = I(x* : Y). □ 

The notion of width is more general than that defined above. For instance in functional 
approximation theory the so called n-widths are used to measure the approximation error 
of some rich general class of functions, e.g., Sobolev class, by the closest element of a 
manifold of simpler function classes. For instance, the Kolmogorov width K n (F) of a class 
F of functions (see j21]) is defined as K n {F) = inf^ nC ^supj g pinf 9g i7 n ||/ — g\\ where F n 
varies over all linear subspaces of F of dimensionality n. Thus from this more general 
set-perspective it is perhaps not surprising that such a basic quantity of width has also an 



information theoretic interpretation as we have shown in (13). The work of [16 considers 
the VC-width of a finite-dimensional set F defined as 

p™(F) =M sup dist(f,H n ) 

where F C IR m is a target set, H n runs over the class H n of all sets H n C WT of VC- 
dimension \C(H n ) = n (see Definition [8| and dist(/, H n ) = infh£H n dist(/, h) where dist() 
denotes the distance between an element / G F and h E H n based on the Z™-norm, 1 < q < 
oo. We can make the following analogy with the information width of ( |13[ ): / corresponds 
to y, F to Y, h to z, n corresponds to /, H n to x (or equivalently to Z x ), the condition 
VC(H n ) = n corresponds to the condition of having a description complexity i{x) = I, 
the class TL n corresponds to the set {x £ X : £(x) = /}, dist(/, h) corresponds to I(z : y), 
dist(/, H n ) = inf fe6 ^n dist(/, h) corresponds to I(x : y) = (1/\Z X \) Ylz&z x J ( z '■ y)> su P/eF 
corresponds to min yC Y, and inf^n :V c(_ff™)=n corresponds to max 3 ..;( J .) =i . 

The notion of information efficiency to be introduced below is based on the acquirer's 
information width I* {I). 

Definition 7 Denote by 

£(x) 



K [Xj = 



I*(£(x)) 

the per-bit cost of maximal information conveyed about an unknown target y in Y considering 
all possible inputs of the same description complexity as x. Consider an input x G X 
informative for Y. Then the efficiency of x for Y is defined by 

_ k*(x) 
ky{x) 



11 



where the cost is defined in Definition^ 
Remark 6 By definition of k*(x) and n y (x) it follows that 

I(x : ¥) 

m{x) = T^R) • (15) 

While we will not use it here, the provider's efficiency can be defined in a similar way. 

Let us consider an example where the above definitions may be applied. Let n be a 
positive integer and denote by [n] = {1, . . . , n}. Let the target space be Y = {0, 1}N which 
consists of all binary functions g : [n] — > {0, 1}. Let Z = P(Y) be the set of indices z of all 
possible classes Y z C Y of binary functions g on [n] (as before for any set E we denote by 
V(E) its power set). Let X = P(Z) consist of all possible (property) sets Z x C Z. Thus 
here every possible class of binary functions on [n] and every possible property of a class is 
represented. Figure Qa) shows I* (I) and Figure Qb) displays the cost k*(1) for this example 
as n = 5, 6, 7. From these graphs we see that the width I* (I) grows at a sub-linear rate with 
respect to I since the cost strictly increases. 

In the next section, we apply the theory introduced in the previous sections to the space 
of binary functions. 

7 Binary function classes 

Let F = {0, l}I n l and write V(F) for the power set which consists of all subsets G C F. 
Let G \= M. represent the statement U G satisfies property Ai" . In order to apply the above 
framework we let y represent an unknown target t £ F and x a description object, e.g., a 
binary string, that describes the possible properties A4 of sets G C F which may contain t. 
Denote by x_m the object that describes property M. Our aim is to compute the value of 
information I(xm '■ F), the description complexity £(xm)> the cost Kf(x^) and efficiency 
t]f(x) for various inputs xm- 

Note that the set Z x used in the previous sections is now a collection of classes G, i.e., 
elements of V(F), which satisfy a property M. We will sometimes refer to this collection 
by M and write \M\ for its cardinality (which is analogous to \Z X \ in the notation of the 
preceding sections). 

Before we proceed, let us recall a few basic definitions from set theory. For any fixed 
subset E C [n] of cardinality d and any / G F denote by f\^ £ {0, l} d the restriction of / 
on E. For a set G C F of functions, the set 

tv G (E) = {f ]E : f £ G} 

is called the trace of G on E. The trace is a basic and useful measure of the combinatorial 
richness of a binary function class and is related to its density (see Chapter 17 in |5j). It has 
also been shown to relate to various fundamental results in different fields, e.g., statistical 
learning theory |26j . combinatorial geometry [20 , graph theory (3j HD] and the theory of 
empirical processes |22j . It is a member of a more general class of properties that are 
expressed in terms of certain allowed or forbidden restrictions pQ . In this paper we focus on 
properties based on the trace of a class which are expressed in terms of a positive integer 
parameter d in the following general form: 

d = maxjli^l : E C [n], condition on tva(E) holds}. 
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The first definition taking such form is the so-called Vapnik-Chervonenkis dimension 

Definition 8 The Vapnik-Chervonenkis dimension of a set G C F, denoted VC{G), is 
defined as 

VC{G) = max{|£| : E C [n], \tr G {E)\ = 2^}. 

The next definition considers the other extreme for the size of the trace. 

Definition 9 Let L{G) be defined as 

L{G) = max{|£| : E C [n],\tr G (E)\ = 1}. 

For any G C F define the following three properties: 

C d = 'L(G) > d' 
V d = 'VC(G) < d' 
V c d = 'VC(G) > d'. 

We now apply the framework to these and other related properties (for clarity, we defer 
some of the proofs to Section 10.2). Henceforth, for two sequences a n , b n , we write a n ~ b n 
to denote that lim n ^oo ^ = 1 and a n <C b n denotes linin-^oo X s - = 0. Denote the standard 
normal probability distribution and cumulative distribution by = (1/^/2tt) exp(— x 2 /2) 
and <P(x) = J^ oQ (j)(z)dz, respectively. The main results are stated as Theorems |2j through 

El 

Theorem 2 Let t be an unknown element of F. Then the value of information in knowing 
that t £ G where G \= Cd, is 



L(x Cd :F) = log\F\-J2^ Cd {k)logk 



n 



k>2 

<t> (-a) log {^) + 2-( n - d )/ 2 (/>(a) + 0(2-( n - d )) 



nn 

l+2 d 



1 

w/iere 

a = 2(1 + 2 d )2" (n+d)/2 - 2 {n ' d)/2 
and the description complexity of xc d is 

^C d )^2 n (^ T ^j-d-clogn 
for some 1 < c < d, as n increases. 

Remark 7 For large n, we have the following estimates: 

L(x Cd : F) ~ n - log ( J ~ d 
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and 

£(x Cd )^2 n -d. 

The cost is estimated by 

f ^ 2n ~ d 

The next result is for property VJ. 

Theorem 3 Let t be an unknown element of F. Denote by a = (2 n — 2 rf+1 )2~ n/ ' 2 . Then the 
value of information in knowing that t 6 G, G \= V^, is 



I(x V c-.F) = log\F\-J2^x v c(k)logk 



k>2 



(n - 1) ( 2 n <P (a) + 2™/ 2 (a) ( 1 + 



(n-l)2" 

Ti — ■ ; — ; — 

2 n <P(a) + 2 n / 2 <f>(a) 

with increasing n. Assume that d = d n > logn then the description complexity of xyc 
satisfies 

l(x V c) « d{2 d + 1) + log(d) - log(2"<P(a) + 2 n / 2 0(a)) - logn + 1. 
Remark 8 For large n, the information value is approximately 

I(x V c :F)~\ 

and 

d * ' n 



£(x V c) ~ d2 a - n - lo 
d \d 

thus 

k f {x V c) ~ £(x V c). 

We note that the description length increases with respect to d implying that the proportion 
of classes with a VC-dimension larger than d decreases with d. With respect to n it behaves 
oppositely. 

The property of having an (upper) bounded VC-dimension (or trace) has been widely 
studied in numerous fields (see the earlier discussion). For instance in statistical learning 
theory [6l [26] the important property of convergence of the empirical averages to the means 
occurs uniformly over all elements of an infinite class provided that it satisfies this property. 
It is thus interesting to study the property Vd defined above even for a finite class of binary 
functions. 

Theorem 4 Let t be an unknown element of F. The value of information in knowing that 
teG, G^V d is 

I(x Vd :F) « l-o(2-™/ 2 ) 
with n and d = d n increasing such that n < d n 2 dn . The description complexity of x\> d is 

£(x Vd ) = -log(l-2-^) 
where £{x\i^) is as in Theorem^ 
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Remark 9 Both the description complexity and the cost of information are approximated 
as 

K F (x Vd ) - £{x Vd ) ~ -log(l - 2 -(° !2<i -™- lo s(t))). 

Relating to Remark |^J while l{xy d ) increases with respect to n and hence the proportion 
of classes with the property Vd decreases as 2~ i( - Xv <i\ the actual number of binary function 
classes that have this property (i.e., the cardinality of the corresponding set Z x ) increases 
with n since 

\Z XV \ = \Z\2-^ d ) = 2 2» ( 1 



<x v .) _ 9 2" , _ 9 -(d2 d -n-log(f)) 



The number of classes that have the complement property V c d also clearly increases since 
l(x\>c) decreases with n. We note that the description length decreases with respect to d 
implying that the proportion of classes with a VC- dimension no larger than d increases with 
d. 

As another related case, consider an input x which in addition to conveying that t £ G 
with VC(G) < d also provides a labeled sample S m = {(&, Ci)}£Li! £ N; d = 
1 < i < m. This means that for all / £ G, = Cii 1 — * — m - We express this by 

stating that G satisfies the property 

V d (S m ) = 'VC(G) < d, G |c = C 

where Gi£ denotes the set of restrictions {f^ : / G G}, f^ = [/(£i), • • • , /(£m)] an d C = 
[Ci,.--,Cm]- The following result states the value of information and cost for property 
V d (S m ). 

Theorem 5 Let t be an unknown element of F and S m = a sample. Then 

the value of information in knowing that t G G where G \= Vd(S m ) is 

I(x Vd(Sm y.F) « m- (2-("~ m )/ 2 ) 

with n and d = d n increasing such that n < d n 2 dn . The description complexity of Xy d rg\ 
is 

£(x Vd{Sm) ) « 2»(1 + log(l - p)) + (<?(a)2- m + <A(a)2(- m )/ 2 ) + (1 - p) 2 " 

w/iere p = 2-"7(2- m + 1), a = (2> - 2 d )/cr, <r = y/2 n p(l -p). 
Remark 10 The description complexity is estimated by 

1 + d2rf (i + ^) +m + lQ g( 1 -^ 

and the cost of information is 

£ ( X V d (S m )) 



KF{x Vd 



(Sm)/ 



rn 

Remark 11 The dependence of the description complexity on d disappears rapidly with 
increasing d, the effect of m remains minor which effectively makes £(x Vd ( Sm j) almost take 
the maximal possible value of 2 n . Thus the proportion of classes which satisfy property 
Vd(S m ) is very small. 
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7.1 Balanced properties 

Theorems [3] and [4] pertain to property and its complement Vd- It is interesting that in 
both cases the information value is approximately equal to 1 . If we denote by P* k a uniform 
probability distribution over the space of classes G C F conditioned on |G| = k (this will 



be defined later in a more precise context in (20)) then, as is shown later, PXki^d) an d 
PZu{V5) vary approximately linearly with respect to k. Thus in both cases the conditional 
density ^ is dominated by the value of k = 2 n_1 and hence both have approximately the 
same conditional entropies Q and information values. Let us define the following: 

Definition 10 A property M is called balanced if 

I(x M : F)=I(x M c : F). 

We may characterize some sufficient conditions for M to be balanced. First, as in the case of 
property Vd and more generally for any property M a sufficient condition for this to hold is 
to have a density (and that of its complement M c ) dominated by some cardinality value k* . 



Representing oj XM (k) by a posterior probability function P n (k\Ai), for instance as in (30) 
for M. = d, makes the conditional entropies H(F\x_m) an d H(F\xm c ) be approximately 
the same. A stricter sufficient condition is to have 

^x M {k) = uJ XM c(k) 

for every k. This implies the condition that 

P n (k\M) = P n (k\M c ) 

which using Bayes rule gives 

P{M c \k) P{M C ) 



P(M\k) P(M) 



for all 2 < k < T . 



In words, this condition says that the bias of favoring a class G as satisfying property M 
versus M c (i.e., the ratio of their probabilities) should be constant with respect to the 
cardinality k of G. Any such property is therefore characterized by certain features of a 
class G that are invariant to its size, i.e., if the size of G is provided in advance then no 
information is gained about whether G satisfies A4 or its complement A4°. 

In contrast, property Cd is an example of a very unbalanced property. It is an example 
of a general property whose posterior function decreases fast with respect to k as we now 
consider: 

Example: Let Q be a property with a distribution P* k (Q) = ca k , < a < 1, c > 0. In a 
similar way as Theorem [2] is proved we obtain that the information value of this property 
tends to 

r , t-,\ g(-o) log (2 n « /(l + a)) + 0(a)/V52* + Q(l/(«2")) 
1{xq : t ) pa n 



l-(l + a)" 



2" 
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with increasing n where a = (2— 2 n p) j sj2 n p{\ — p) and p = a/(l+a). This is approximated 
as 

I(xq : F) ~ n - ( n + log ( " ) ) = log ( 1 + 



1 + a J J \ a 

For instance, suppose P* k (Q) is an exponential probability function then taking a = 1/e 
gives an information value of 

I(x Q : F) ~ log(l + e) ~ 1.89. 

For the complement Q c , if we approximate P* k (Q c ) = 1 — ca k ~ 1 and the conditional 
entropy ([7]) as 

* 2^ -Pnifc) log fc « log(2" A ) = n - 1, 



fe>2 



where P n {k) is the binomial probability distribution with parameter 2 n and 1/2, then the 
information value is approximated by 

I(xqc : F) ~ n — (n — 1) = 1. 

By taking a to be even smaller we obtain a property Q which has a very different information 
value compared to Q c . 

8 Comparison 

We now compare the information values and the efficiencies for the various inputs x con- 
sidered in the previous section. In this comparison we also include the following simple 
property defined next: let G £ V({0, l} n ) be any class of functions and denote by the 
identity property M{G) of G the 'property which is satisfied only by G\ We immediately 
have 

I(x M{G) :F)=n-log\G\ (16) 

and 

^(G))=2 n -log(l) = 2™ 
since the cardinality |A4(G)| = 1. The cost in this case is 

on 

k f(x M (g)) 



n — log \G\ 



Note that x conveys that t is in a specific class G hence the entropy and information values 
are according to Kolmogorov's definitions ^ and Q. The efficiency in this case is simple to 
compute using (15): we have I*(£(x)) = I*{2 n ) and the sums in (14) vanish since r(2 n ) = 1 
thus I*(2 n ) = n and i] F (x) = (n - log \G\)/n. 

Let us first compare the information value and the efficiency of three subcases of this 
identity property with the following three different class cardinalities: |G| = y/n, \G\ = n 
and \G\ = 2 n ~^. Figure [i] displays the information value and Figure [3] shows the efficiency 
for these subcases. As seen the information value increases as the cardinality of G decreases 
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which follows from (16). The efficiency T]f{%) for these three subcases may be obtained 
exactly and equals (according to the same order as above) 1 — (logra)/(2n), 1 — (logn)/n 
and 1 / ^pri. Thus a property with a single element G may have an efficiency which increases 
or decreases depending on the rate of growth of the cardinality of G with respect to n. 

Let us compare the efficiency for inputs xc d , xv c , xy d and ^v d (S' m )- As an example, 
suppose that the VC-dimension parameter d grows as d{n) = y/n. As can be seen from 
Figure |4j property Vd is the most efficient of the three staying above the 80% level. Letting 
the sample size increase at the rate of m(n) = n a then from Figure|5]the efficiency of Vd(S m ) 
increases with respect to a but remains smaller than the efficiency of property Vd- Letting 
the VC-dimension increase as d(n) = n b then Figure [6] displays the efficiency of Vd(S m ) as 
a function of b for several values of a = 0.1, 0.2, . . . , 0.4 where n is fixed at 10. As seen, the 
efficiency increases approximately linearly with a and non-linearly with respect to b with a 
saturation at approximately b = 0.2. 

9 Conclusions 

The information width introduced here is a fundamental concept based on which a com- 
binatorial interpretation of information is defined and used as the basis for the concept of 
efficiency of information. We defined the width from two perspectives, that of the provider 
and the acquirer of information and used it as a reference point according to which the 
efficiency of any input information can be evaluated. As an application we considered the 
space of binary function classes on a finite domain and computed the efficiency of informa- 
tion conveyed by various class properties. The main point that arises from these results is 
that side-information of different types can be quantified, computed and compared in this 
common framework which is more general than the standard framework used in the theory 
of information transmission. 

As further work, it will be interesting to compute the efficiency of information in 
other applications, for instance, pertaining to properties of classes of Boolean functions 
/ : {0,1}™ — > {0,1} (for which there are many applications, see for instance |8]). It will 
be interesting to examine standard search algorithms, for instance, those used in machine 
learning over a finite search space (or hypothesis space) and compute their information ef- 
ficiency, i.e., accounting for all side information available for an algorithm (including data) 
and computing for it the acquired information value and efficiency. 

In our treatment of this subject we did not touch the issue of how the information is used. 
For instance, a learning algorithm uses side-information and training data to learn a pattern 
classifier which has minimal prediction (generalization) error. A search algorithm in the area 
of information-retrieval uses an input query to return an answer set that overlaps as many 
of the relevant objects and at the same time has as few non-relevant objects as possible. 
In each such application the information acquirer, e.g., an algorithm, has an associated 
performance criterion, e.g., prediction error, percentage recall or precision, according to 
which it is evaluated. What is the relationship between information and performance, does 
performance depend on efficiency or only on the amount of provided information ? what 
are the consequences of using input information of low efficiency ? For the current work, 
we leave these questions as open. The remaining parts of the paper consist of the technical 
work used to obtain the previous results. 
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10 Technical results 



In this section we provide the proofs of Theorems [2] to [5j Our approach is to estimate 
the number of sets G C F that satisfy a property M. Using the techniques from [28] we 
employ a probabilistic method by which a random class is generated and the probability 
that it satisfies A4 is computed. As we use the uniform probability distribution on elements 
of the power set V(F) then probabilities yield cardinalities of the corresponding sets. The 
computation of u x {k) and hence of ([6]) follows directly. It is worth noting that, as in [T2], the 
notion of probability is only used here for simplifying some of the counting arguments and 
thus, unlike Shannon's information, it plays no role in the actual definition of information. 

Before proceeding with the proofs, in the next section we describe the probability model 
for generating a random class. 

10.1 Random class generation 

In this subsection we describe the underlying probabilistic processes with which a random 
class is generated. We use the so-called binomial model to generate a random class of 
binary functions (this has been extensively used in the area of random graphs [llj). In this 
model, the random class T is constructed through 2 n independent coin tossings, one for 
each function in F, with a probability of success (i.e., selecting a function into !F) equal to p. 
The probability distribution P n ^ p is formally defined on V{F) as follows: given parameters 
n and < p < 1, for any G G V(F), 

P n ^ = G)=p\ G \{l-p? n -\ G \. 

In our application, we choose p = 1/2 and denote the probability distribution as 

n, 2 

It is clear that for any element G E V(F), the probability that the random class T equals 
G is 

/ \ 2 n 

a n = P n (F = G)= (-) (17) 

and the probability of T having a cardinality k is 

P n (\F\ = k) = I \a n , l<k<2 n . (18) 

The following fact easily follows from the definition of the conditional probability: for any 
set B C V(F), 



P n (T € B\ \T\ = k) = = ^L. (19) 

Denote by 



J2ceB a n = \B\ 

\k) a n [k) 



F (k) = { Ge . | G | = 

the collection of binary- function classes of cardinality k, 1 < k <2 n . Consider the uniform 
probability distribution on F^ which is defined as follows: given parameters n and 1 < 
k < 2 n then for any G € V(F), 

KA G ) = ^ XGeFM, (20) 
I k ) 
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and P* k (G) = otherwise. Hence from (IT9| and ([20} it follows that for any B C V{F) 



P n {T G B\ \T\ = k) = P^F G B). (21) 

It will be convenient to use another probability distribution which estimates P* k and is 
defined as follows. Construct a random n x k binary matrix by fair-coin tossings with the 
nk elements taking values or 1 independently with probability 1/2. Denoting by Q* nk the 
probability measure corresponding to this process then for any matrix U G U n xk({0, 

Q*n,k( U ) = Ijnk- 

Clearly, the columns of a binary matrix U are vectors of length n which are binary functions 
on [n]. Hence the set cols({7) of columns of U represents a class of binary functions. It 
contains k elements if and only if cols(f7) consists of distinct elements, or less than k 
elements if two columns are the same. Denote by S a simple binary matrix as one all of 
whose columns are distinct ([lj). We claim that the conditional distribution of the set of 
columns of a random binary matrix, knowing that the matrix is simple, is the uniform 
probability distribution P* k . To see this, observe that the probability that the columns of 
a random binary matrix are distinct is 

- r|r - 1) /- t+ ' ) (22) 

For any fixed class G G V{F) of k binary functions there are k\ corresponding simple 
matrices in W n xfc({0, 1}). Therefore given a simple matrix S, the probability that cols(S') 
equals a class G is 



k\ '2" k 1 

Q*n,k( G \ S ) = 2^fc 2 n (2 n - 1) • • • (2 n - k + 1) = (2™) = P * l ' k ^' 



Using the distribution Q* n k enables simpler computations of the asymptotic probability of 
several types of events that are associated with the properties of Theorems [2] - |5] We 
henceforth resort to the following process for generating a random class G: for every 1 < 
k < 2 n we repeatedly and independently draw matrices of size n x k using Q* ■ until we get 
a simple matrix M nX fc. Then we randomly draw a value for k according to the distribution 



of (18) and choose the formerly generated simple matrix corresponding to this chosen k. 



Since this is a simple matrix then by (23 ) it is clear that this choice yields a random class G 
which is distributed uniformly in F^ k ' according to P* k - This is stated formally in Lemma 
[3] below but first we have an auxiliary lemma that shows the above process converges. 

Lemma 2 Let n = 1, 2, . . . and consider the process of drawing sequences Sm = {M k l l } r ?! =1 , 

1 < k < 2 n , all of length m where the k th sequence consists of matrices Mjp G hi nX k({0, 1}) 
which are randomly and independently drawn according to the probability distribution Q* lk - 
Then the probability that after m = ne 2 " trials there exists a k such that no simple matrix 
appears in Sm , converges to zero with increasing n. 
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Proof: Let a(n,k) = Q^ k (S) be the probability of getting a simple matrix M n ^ £ I4 n xk- 

(k) 

Then the probability that there exists some k such that Sm consists of only non-simple 
matrices is 

2 n 2™ 

q(n,m) < ^(1 - a(n, k)) m < ^V^M). (24) 
fe=i fc=i 



From ( 22 ) we have 

fc-l 2" 

In a(n, fc) = J^ln(2 n - i) - nfe In 2 = ^ lnj - rafcln2. (25) 

i=0 j=2 n -(fc-l) 

Since In x is increasing function of x then for any pair of positive integers 2 < a < b we have 

Vlnj> / lnxcte = 6(ln6- 1) - (a- l)(ln(a- 1) - 1). 

j=a 

Hence 

lna(n, k) > 2 n (nln2 - 1) - (2 n - fc)(ln(2" - k) - 1) - nfcln2 = 6(n, fc) 



and the right side of ( 24 ) is now bounded as follows 

2" 2 n 



P b(n,k) 

e "" 

k=l k=l 



^ e -ma(n,fc) < ^^^^ {n ' \ (26) 



From a simple check of the derivative of b(n, k) with respect to k it follows that 6(n, fc) is a 
decreasing function of k on 1 < A; < 2 n . Replacing each term in the sum on the right side 
of (26) by the last term gives the following bound 

_9 n 

q(n,m) <e nln2 ~ me . (27) 
The exponent is negative provided 

. . on 

m > nm(2)e . 

Choosing m = ne 2 " guarantees that q(n, m) — > with increasing n. □ 

The following result states that the measure Q* n k may replace P* k uniformly over 1 < 
k<2 n . 

Lemma 3 Let B C V(F). Then 



as n tends to infinity. 
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Proof: From (23) we have 



P*AB) = Q* n AB\S) 



Then 



max \Pn,k( B ) ~ Qn,k( B )\ = max 



Qlk(Bf\S) 



Q*nA B ) 



< max 

k 




< max 
k 



Qn,k(S) 

From Lemma |2] it follows that 



Q* n , k (Bf]S)-Q* nik (B)Q* n:k (S) 
QIA B fl S ) ~ Qn,k(B)\ + max \Q^ k (B)(l - Q^ fc (5))| J 
(max \Q* n>k (B f\S)- Q* n>k (B)\ + m fc ax \Q* n;k (B)\ max |l - Q* n>k (S)\j . (28) 



(29) 



max|l - Q* fc (S)| - 0, max |l/Q* )Jfc (S)| - 1 
with increasing n. For any 1 < k < 2 n , 

Qn, k (B) + Q* n>k (S) - 1 < Q* n , k (B f]S)< Ql k (B) 
and by Lemma [2j Q* nk {S) tends to 1 uniformly over 1 < k < 2 n with increasing n. Hence 



maxfc 



Q* njk (Bf)S)-Q^ k (B) 



which together with (28) and (29) implies the statement 

□ 



of the lemma. 

We now proceed to the proofs of the theorems in Section [7] 
10.2 Proofs 

Note that for any property Ai, the quantity ui x (k) in ^ is the ratio of the number of 
classes G G that satisfy Ai to the total number of classes that satisfy Ai. It is 

therefore equal to -Pnd^ 7 ) = k \ T |= Ai). Our approach starts by computing the probability 
PniJ 7 \= Ai | \T\ = k) from which -Pnd-? 7 ! = k | T |= Ai) and then u> x (k) are obtained. 



10.2.1 Proof of Theorem [2] We start with an auxiliary lemma which states that the 
probability PniT |= Cd \ \F\ = k) possesses a zero-one behavior. 

Lemma 4 Let J- be a class of cardinality k n and randomly drawn according to the uniform 
probability distribution P* k on F^ kn \ Then as n increases, the probability P* k (J- \= Cd) 
that T satisfies property Cd tends to or 1 if k n log(2n/<i) or k n = 1 + K n , K n <C 
(log(ra))/<i, respectively. 
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Proof: For brevity, we sometimes write k for k n . Using Lemma [3] it suffices to show that 
Qnk(f H £d) tends to f or under the stated conditions. For any set S C [n], IS 1 ) = d 
and any fixed t> G {0, 1} 6 *, under the probability distribution Q* n fe , the event £"„ that every 
function / G T satisfies /15 = v has a probability (l/2) kd . Denote by E$ the event that 
all functions in the random class T have the same restriction on S. There are 2 d possible 
restrictions on S and the events E v , v G {0, f } d , are disjoint. Hence Q* n ^(Es) = 2 d (l/2) kd = 
2-(fc-l)rf_ phe even t t na t ^ has property Cd, i.e., that L(T) > d, equals the union of E$, 
over all S C [n] of cardinality d. Thus we have 

g; fc (^h^) = Qn,k[ U £s 

\SC[n]:|5|=d 



-(k-i) d n d (l -o(l)) 
d\ 




For k = k n ~^> \og{2n/d) the right side tends to zero which proves the first statement. Let 
the mutually disjoint sets Si = {id + f , id + 2, . . . , d{i + f )} C [n], < i < m — 1 where 
m = L?T-/dJ. The event that .M^ is not true equals f) s .\ S \ =d Es- Its probability is 

= 1 " Qn,k \ (J E S ) < max i°' 1 " Qn,k ( U Eg)}. 

\S:\S\=d J \i=0 J 

Since the sets are disjoint and of the same size d then the right hand side equals max{0, f — 
m Qn,k( E [d})} ■ Tnis equals 

max{0,l- [^|2-( fe - 1 > d } 

which tends to zero when k = k n = 1 + K n , n n ^ (log(n))/d. The second statement is 
proved. □ 

Remark 12 While from this result it is clear that the critical value of k for the conditional 
probability P n {Cd\k) to tend to 1 is 0(log(n)), as will be shown below, when considering the 
conditional probability P n {k\Cd), the most probable value of k is much higher at 0(2 n ~ d ). 

We continue now with the proof of Theorem |2j For any probability measure P on V(F) 
denote by P(k\Cd) = Pd-^l = k\T |= Li)- By the premise of Theorem [2] the input x 
describes the target t as an element of a class that satisfies property Cd- In this case the 
quantity u> x {k) is the ratio of the number of classes of cardinality k that satisfy Cd to the 



total number of classes that satisfy Cd- Since by (f7) the probability distribution P n is 
uniform over the space V(F) whose size is 2 2 then 

»*( k ) = P i)fw =Pn{k\C d ). (30) 
Pn{C d )2 z 

We have 



E,=l Pn(C d \j)Pn(j) 
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By (21 ), it follows therefore that the sum in ([6| equals 



2 Wa;(K) log(Kj = 2« * ~ 

fc=2 k=2^j=l P n,j(£d)Pn(j) 



(31) 



Let N = 2 n , then by Lemma [3] and from the proof of Lemma [4] as n (hence N) increases, 
it follows that 

s-^ \ d(k-l) 



A(N, d) 



(32) 



where A(iV, d) satisfies 



logiV „ /Ar . log d iV 
< A(iV, d) < 



d\ 



Let p = 1/(1 + 2 ) then using (32) the ratio in (31 ) is 



Ef =1 (>'(i-P)^ ' 

Substituting for ./V and p, the denominator equals 

2" 



(33) 



l-(l-p)" = l_(l 



1 



1 + 2 d 



1 



1 + 2 d 



(34) 



Using the DeMoivre-Laplace limit theorem [9], the binomial distribution P^ tP {k) with pa- 
rameters N and p satisfies 



1 / k — // 



a 



TV — > oo 



where </>(x) = (l/y/2n) exp(— x 2 /2) is the standard normal probability density function and 
/x = Np, a = y/ Np(l — p). The sum in the numerator of (33) may be approximated by an 
integral 

' x — [1 



1 

<? J2 



log x dx 



(2- A t)/<7 



(x) log(<rx + n)dx. 



The log factor equals log// + log(l + xa/fi) = log// + xaj\i + 0(x 2 (a / p) 2 ). Denote by 
a = (2 — //)/cr then the right side above equals 



(7 



<f>(-a) log // + -0(a) + 
A 4 



where <^(x) is the normal cumulative probability distribution. Substituting for //, a, p and 



N, and combining with (34) then (31) is asymptotically equal to 

2™ $| 



> ^(fc) logfc 



-a) log + 2-(™- rf )/ 2 0(a) + 0(2"(- rf )) 



fc=2 



1 



2 d 
l+2 d 



(35) 
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where 

a = 2(1 + 2 d )2- { - n+d)/2 - 2("- d )/ 2 . 
In Theorem [2] the set Y is the class F (see (|6])) hence log |_F| = n and 

I(x \ F) = re — ^x(fc) log k. 

k>2 

Combining with (35) the first statement of Theorem [2] follows. 

We now compute the description complexity £(xc d ). Since in this setting Y is F and Z is 
V(F) then, by ([9]), the description complexity l(xc d ) is 2 n — log \C d \. Since , the probability 
distribution P n is uniform on V{F) then the cardinality of C d equals 

\C d \ =2 2n P n (£ d ). 

It follows that 

£(x Cd ) = -log P n (C d ) 
hence it suffices to compute P n (C d ). Letting N = 2 n , we have 

N N 
P n {C d ) = Y,Pn(C d \k)P N (k) = Y J Pn,MP N {k). 
k=l k=l 

Using (32) and letting p = 1/(1 + 2 d ) this becomes 

n d(fc-i) n\ N /n 



,i-py V 2 

Letting g = (1 — p) N , it follows that 



1 V v /I \ 

A^coa-a-p)^). 



1 - _ 

N - d - log A(JV, d) + 9 + 0(g 2 ) + log(g) 
iV(l - v ~ 0{p 2 )) - d - clog log N + o(l) 



where 1 < c < d. Substituting for N gives the result. 



10.2.2 Proof of Theorem [3] We start with an auxiliary lemma that states a threshold 
value for the cardinality of a random element of F^ k ' that satisfies property V5. 

Lemma 5 For any integer d > let k be an integer satisfying k > 2 d . Let J 7 be a class of 
cardinality k and randomly drawn according to the uniform probability distribution P* k on 
FW. Then 

lirn P n * fc (V d c ) = 1. 

n—>oo ' 
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Remark 13 When k n < 2 d there does not exist an E C [n] with \trE(J~)\ = 2 d hence 
Pnk n 0^d) = 0- For k n 3> 2 d , Pnk n (^d) t en ds to 1. Hence, for a random class T to have 
'property the critical value of its cardinality is 2 d . 

We proceed now with the proof of Lemma |5j 

Proof: It suffices to prove the result for k = 2 d since P* k (T \= VJ) > P* 2d (T \= V%). As in 
the proof of Lemma |4 we represent P* 2 d by Q* n 2d using (23) and with Lemma |3j it suffices 
to show that Q* n 2d (^f= VJ) tends to 1. Denote by Ud the 'complete' matrix with d rows and 
2 d columns formed by all 2 d binary vectors of length d, ranked for instance in alphabetical 
order. The event |= V^" occurs if there exists a subset S = {ii, . . . C [n] such 
that the submatrix whose rows are indexed by S and columns by [2 d ], is equal to Ud- Let 
Si = {id + 1, id + 2, . . . , d(i + 1)}, < i < m — 1, be the sets defined in the proof of Lemma 
[4] and consider the m corresponding events which are defined as follows: the i th event is 
described as having a submatrix whose rows are indexed by Si and is equal to Ud- Since 
the sets Si, 1 < i < m are disjoint it is clear that these events are independent and have 
the same probability 

Q* njk (Si) = 2~ d2d . 
Hence the probability that at least one of them is fulfilled is 

1 _ (1 _ 2 - d2d )L n/dJ 

which tends to 1 as n increases. □ 



We continue with the proof of Theorem [3] As in the proof of Theorem [2] since by (17) 
the probability distribution P n is uniform over V(F) then 

Considering Remark |13[ in this case the sum in (|6| is 

* Pl k {Vj)P n {k)\ogk 

We now obtain its asymptotic value as n increases. From the proof of Lemma [5j it follows 
that for all k > 2 d , 

PIM) « 1 - (1 - PY k , P = 2~ d2d ,r= Jj. 

Since (3 is an exponentially small positive real we approximate (1 — (3) rk by 1 — rk(3 (by 
assumption, n < d2 d hence this remains positive for all 1 < k < 2 n ). Therefore we take 

PIM) « rk(3 (37) 

Yl=2* kP n(k) log k 
2^j=2<ijPn{3) 



and ( 36 ) is approximated by 
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As before, for simpler notation let us denote N = 2 n and let Pat (A;) be the binomial distri- 
bution with parameters N and p = 1/2. Denote by \i = N/2 and a = yOV/4, then using 
the DeMoivre-Laplace limit theorem we have 



P N (k) « -0 



k — fi 



a 



, N oo. 



Thus ( 38 ) is approximated by the ratio of two integrals 

f(2 d -(i)/iT 4>(x)(crx + /x) log(crz + /x)dx 



The log factor equals log/i + log(l + xa/[i) = log/i + xa / [i + 0(x 2 (<r///) 2 ). Denote by 



(39) 



a = O- 2 d )/cr 
then the numerator is approximated by 



<P{a)n log n + cr(l + log /x)0(a) + O ( -^a 2 ^] 



(40) 



log(JV/2) <2>(a)iV/2 + 1 + 



Nlog{N/2) 



(a) VAT/2 



Similarly, the denominator of (39) is approximated by <P(a)N/2 + <j)(a)\N/2. The ratio, 
and hence (36), tends to 

log(AT/2) U(a)N/2 + (l + Nlo f {N/2) ) 4>(a)y/N/2) 



<P{a)N/2 + 4){a)VN/2 



Substituting back for a then the above tends to log(A r /2) = logiV — 1. With N = 2 n and 
^ the first statement of the theorem follows. 

We now compute the description complexity £(x\;c). Following the steps of the second 
part of the proof of Theorem [2] (Section 10.2.1) we have £(xy^) = — logP n (VJ). Using (37) 
the probability is approximated by 

N 

Pn(V c d ) ~ r/3 kP N(k) 

k=2 d 

and as before, this is approximated by r(3(<P(a)N + <j)(a)y/~N)/2. Thus substituting for r 
and (3 we have 



- log P n (V c d ) « d(2 d + 1) + log(d) - log ( $(a)N + 0(a) VN ) - log log N + 1. 
Substituting for iV yields the result. 
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10.2.3 Proof of Theorem [4] The proof is almost identical to that of Theorem [3} From 
(37) we have 



, 1 < k < 2 d 
P nk( V d) ~ 1-rPk ,2 d <k<2 n 



hence 



E 



- P n,k(Vd)Pn{k) log k Yl=2 p n(k) log k — r/3 Yl=2* kPnik) log k 



k=2 



EU P nAVd) P nti) 



T,T=l P nU)-rPET=2 d j P n(j) 



(41) 



Let a be as in (40) and denote by b = (// — 2) /a and 

s = $(a)N/2 + <f>(a)VN/2 

then the numerator tends to log(N/2)(<P(b) — rs/3) + cj)(b)/yN and the denominator tends 
to 1 - (1/2)^ - rs/3. Then fcttl tends to 



log W 2)^" + 



log(iV/2) + 



l -rs/3 VN (1 - (1/2) N -rs(3) " v ' ' y/N(l-rs0) 
Substituting for r, /3 and A" yields the statement of the theorem. 



10.2.4 Proof of Theorem [5] The probability that a random class of cardinality k sat- 
isfies the property Vd(S„ 



is 



Pn{T |= V d (S m ) I |^| = k) -- 

= p n (f\=v d \r\z 



P n (F\=V d ,F\t = <:\\r\ = k) 
■- C, \F\ = k)P n (^ = (\\F\ = k). 



(42) 



The factor on the right of ( 42 ) is the probability of the condition Et that a random class 



T of size k has for all its elements the same restriction ( on the sample £. As in the proof 
of Lemma |4 it suffices to use the probability distribution Q* n k in which case Q* n ^(E^) is 7 fe 
where 7 = (l/2) m . The left factor of (42) is the probability that a random class T with 



cardinality k which satisfies Eq will satisfy property Vd- This is the same as the event that 
a random class T on [re] \ S m satisfies property V d . Its probability is P*_ mk (Vd) which 
equals 1 for k < 2 d and using (37) for k > 2 d it is approximated as 



l-P* k (V c d )*l-rk(3 



where r = (n — m)/{d2 ) and (3 = 2 . Hence the conditional entropy becomes 

l kp Lm,k(Vd)Pn(k)lQgk ^ ET=2~f kp n(k)l0gk-rf3ET=2^ k kPn(k)l0gk 
h E?d ^K-mA^PnU) ~ Ef=l 7^n(j) - r/3 £^ 7^n(i) 

Let » = 7/(1 + 7), N = 2 n and denote by P^ p (k) the binomial distribution with parameters 

(44) 



A" and p. Then ( 43 ) becomes 



Y.k=2 P N,p( k ) lo § fc ~ r l 3 T,k=2d kPN,p(k) log fc 



2N 



With n = Np and a = y/Np(l - p) let »=(//- 2 d )/a, b=(fi- 2)/a and 

s = <P(a)Np + 4>{a) \/Np(l - p) 

then the numerator tends to 



\og(Np) (#(6) -rs/3) + Y ~ ' 1 '" 



' jv v p 

and the denominator tends to 1 — (2 m /(2 m + 1))^ — rs(3. Therefore (44) tends to 



1 — n 

log(Np) + 



N (1 - o(l) - rs(3) \ P 

Substituting for r, s, (5 and N yields the first statement of the theorem. 
Next, we obtain the description complexity. We have 

e(x Vd(Sm) ) = -io g p n (v d (s m )). 



The probability P n (Vd(>Sm)) is the denominator of (43) which equals the denominator of 
(44) multiplied by a factor of (2(1 — p))~ N hence from above 

-logP„(V d OS^))«-log^l- (j^r) -rP^j + N + Nlog(l-p). 

Let q = ^ i4.2 m ) r @ s then we have as an estimate 

t(x Vd (S m) ) « log + iV(l + log(l - p)) 

from which the second statement of the theorem follows. ■ 
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Figure 1: (a) I*(t), (b) k*(£) 
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Figure 2: Information I(xm{G) '■ F) for (a) \G\ = y/n,(b) \G\ = n and (c) \G\ = 
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Figure 3: Efficiency tif(xm(G)) f° r ( a ) |G| = y/n,{h) \G\ = n and (c) \G\ = 2 r 
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Figure 4: Efficiency t]f(x) for (a) a;£ d , (b) a; v = and (c) xy d , d = y/n 
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Figure 5: Efficiency VF( x v d (s m )), with m = n a , a — 0.01, 0.1, 0.5, 0.95, d = 
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Figure 6: Efficiency VF{ x Vd(8 m ))i with n = 10, m(n) = m a , d 
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